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Abstract. We consider to learn a causal ordering of variables in a linear 
non-Gaussian acyclic model called LiNGAM. Several existing methods 
have been shown to consistently estimate a causal ordering assuming that 
all the model assumptions are correct. But, the estimation results could 
be distorted if some assumptions actually are violated. In this paper, we 
propose a new algorithm for learning causal orders that is robust against 
one typical violation of the model assumptions: latent confounders. We 
demonstrate the effectiveness of our method using artificial data. 
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1 Introduction 

Bayesian networks have been widely used to analyze causal relations of variables 
in many empirical sciences A common assumption is linear-Gaussianity. 
But this poses serious identifiability problems so that many important models 
are indistinguishable with no prior knowledge on the structures. Recently, it was 
shown [5] that use of non-Gaussianity allows the full structure of a linear acyclic 
model to be identified without pre-specifying any causal orders of variables. The 
new model, a Linear Non-Gaussian Acyclic Model called LiNGAM [9], is closely 
related to independent component analysis (ICA) [7]. 

Existing estimation methods [MTU] for LiNGAM learn causal orders assuming 
that all the model assumptions hold. Therefore, these algorithms could return 
completely wrong estimation results when some of the model assumptions is 
violated. Thus, in this paper, we propose a new algorithm for learning causal 
orders that is robust against one typical model violation, i.e., latent confounders. 
A latent confounder means a variable which is not observed but which exerts a 
causal influence on some of the observed variables. 

The paper is organized as follows. We first review LiNGAM [9j and its ex- 
tension to latent confounder cases [5] in Section [2] In Section [31 we propose 
a new algorithm to learn causal orders in LiNGAM with latent confounders. 
Simulations are conducted in Section [H We conclude this paper in Section [5| 
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2 Background: LiNGAM with latent confounders 

We briefly review a linear non-Gaussian acyclic model called LiNGAM [9] and 
an extension of the LiNGAM to cases with latent confounding variables [5] . 
In LiNGAM [5], causal relations of observed variables Xi are modeled as: 

k{j)<k(i) 

where k{i) is such a causal ordering of variables Xi that they graphically form 
a directed acyclic graph (DAG) so that no later variable determines, i.e., has 
a directed path on any earlier variable, are external influences, and bij are 
connection strengths. In matrix form, the model ([l} is written as 

X = Bx + e, (2) 

where the connection strength matrix B collects bij and the vectors x and e 
collect Xi and ej. Note that the matrix B can be permuted to be lower tri- 
angular with all zeros on the diagonal if simultaneous equal row and column 
permutations are made according to a causal ordering k{i) due to the acyclicity. 
The zero/non-zero pattern of bij corresponds to the absence/existence pattern 
of directed edges. External influences follow non- Gaussian continuous distri- 
butions with zero mean and non-zero variance and are mutually independent. 
The non-Gaussianity assumption on e,; enables identification of a causal ordering 
k(i) based on data x only [9|. This feature is a big advantage over conventional 
Bayesian networks based on the Gaussianity assumption on [11] . 

Next, LiNGAM with latent confounders |B] can be formulated as follows: 

x = Bx + Af + e, (3) 

where the difference with LiNGAM ^ is the existence of latent confounding 
variable vector f . A latent confounding variable is such an latent variable that is 
a parent of more than or equal to two observed variables. The vector f collects 
non-Gaussian latent confounders fj with zero mean and non-zero variance (j = 
1, • • • ,q). Without loss of generality j6|, latent confounders fj are assumed to be 
mutually independent. The matrix A collects Ay which denotes the connection 
strength from fj to Xi. For each j, at least two Ay are non-zero since a latent 
confounder is defined to have at least two children. Further, it is assumed [S] that 
correlation and conditional correlation of Xi, fi and arc entailed by the graph 
structure only, i.e., the zero/non-zero status of bij and Xij. This is a well-known 
assumption called faithfulness in causal discovery 

The central problem of causal discovery based on the latent variable LiNGAM 
in Equation is to estimate as many of causal orders fc(i) and connection 
strengths bij as possible based on data x only. This is because in many cases 
only an equivalence class of the true model whose members produce the exact 
same observed distribution is identifiable [B]. 

In [B] , an estimation method based on overcomplcte ICA was proposed. How- 
ever, overcomplete ICA methods are often not very reliable and get stuck in local 
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optima. Thus, in [5], a method that docs not use overcomplcte ICA was proposed 
to first find variable pairs that are not affected by latent confounders and then 
estimate a causal ordering of one to the other instead of a causal ordering of 
more than two variables. 

3 A hybrid estimation approach 

In this section, we propose a new approach for estimating causal orders of more 
than two variables without explicitly modeling latent confounders. We first pro- 
vide principles to identify such an exogenous (root) variable and a sink variable 
that are not affected by latent confounders in the latent variable LiNGAM in 
Equation ^ (if such variables exist) and next present an estimation algorithm. 
Recent estimation methods [Ellin] for LiNGAM in Equation and its non- 
linear extension [5] learn a causal ordering by finding causal orders one by one 
either from the top downward or from the bottom upward assuming no latent 
confounders. We extend these ideas to latent confounder cases. 

We first generalize Lemma 1 of [TUl for the case of latent confounders. 

Lemma 1 Assume that all the model assumptions of the latent variable LiNGAM 
{3^ fl'^e fnet and the sample size is infinite. Denote by rp'' the residuals when Xi 
are regressed on Xj: r|"'^ = .t,; — Xj (i ^ j). Then a variable Xj is an 

exogenous variable in the sense that it has no parent observed variable nor latent 
confounder if and only if Xj is independent of its residuals r'f^ for all i ^ j . □ 

Next, we generalize the idea of [5] for the case of latent confounders. 

Lemma 2 Assume that all the model assumptions of the latent variable LiNGAM 
(0) are met and the sample size is infinite. Denote by ^(-j) a vector that contains 

( — i) 

all the variables other than Xj. Denote by r^ the residual when Xj is regressed 



sense that it has no child observed variable nor latent confounder if and only if 



The proofs of these lemmas are given in the appendix. 

Thus, we can take a hybrid estimation approach that uses these two princi- 
ples. We first identify an exogenous variable by finding a variable that is most 
independent of its residuals and remove the effect of the exogenous variable 
from the other variables by regressing it out. We repeat this until independence 
of every variable and any of its residuals is statistically rejected. Dependency 
between every variable and any of its residuals implies that such an exogenous 
variable in Lemma [T] does not exist or some model assumption of latent vari- 
able LiNGAM is violated. Similarly, we next identify a sink variable in the 
remaining variables by finding a variable that its regressors and its residual are 



X(_j) is independent of its residual 



on ^(-j), i.e., rj ■'' = Xj cr 
covariance matrix of [xj , xT 




□ 



4 T. Tashiro, S. Shimizu, A. Hyvarinen and T. Washio 



most independent and disregard the sink variable. We repeat this until indepen- 
dence is statistically rejected for every variable. We test pairwise independence 
between variables and the residuals using a kernel-based independence measure 
called HSIC j4] and combine the resulting p-values using a well-known Fisher's 
method [3]. We use Bonferroni correction for multiple comparison dividing the 
significance level by the maximum number of tests p—1. 
Thus, the estimation consists of the following steps: 

1. Given a p-dimensional random vector x, a set of its variable subscripts 

a p X n data matrix of the random vector as X and a significance level 
a, initialize an ordered list of variables Khead '■= and Ktaii '■= and 
m := 1. Khead and Ktaii denote first \Khead\ orders of variables and last 
\Ktaii\ orders of variables respectively, where each of \Khead\ and \Ktaii\ 
denotes the number of elements in the list. 

2. Let x=x and X=X and find causal orders one by one from the top downward: 

(a) Do the following steps for all j G U\Khead'- Perform least squares regres- 
sions of Xi on Xj for all i G t/ \ Khead (i 7^ j) and compute the residual 
vectors f . Then, find a variable Xm that is most independent of its 
residuals: 

Xm = aig max Pptsherlij, f^^'), (4) 

where Ppisherixj^i^-'^) is the p- value of the test statistic defined as 
-2 log{-Pi/(5j, rp^)}, where Pi/(ij, fp^) is the p-value of the HSIC. 

(b) Go to StepElif PK.?.er(im,f^™') < 1). 

(c) Append m to the end of Khead and let x := f'™) and X := R(™'. If 
\Khead\ = P ~ 1, append the remaining variable subscript to the end of 
Khead and go to StepHl Otherwise, go back to Step (Pa|) . 

3. If \Khead\ < p - 2II let x' = X and X' = X and U' -.^ U \ Khead and find 
causal orders one by one from the bottom upward: 

(a) Do the following steps for all j £ U' \ Ktaii- Collect all the variables 
except x'j in a vector x'^^j)- Perform least squares regressions of x'j on 

x'^_^-j and compute the residual r'j Then, find such a variable x'^ 
that its regressors and its residual are most independent: 

x'm = a.i-g max PFisher{^'(^.),r'\ ^^). (5) 

(b) Go to StepHif PF,sher{^^^_m)yf^) < a/{p - 1). 

(c) Append m to the top of Ktaii and let x' = x'^_^^^X' = X'^_^^. Go to 
Step H if \U' \ Ktatil < 3.3 Otherwise go back to Step ([5a)). 

4. Estimate connection strengths bij for variables in Khead and Ktaii by doing 
multiple regression of every variable Xi in Khead and Ktaii on all of its non- 
descendants Xj with k{j) < k{i). 



^ We do not examine remaining two variables in Step [3] since it is already implied in 
Step [2] that some latent confounders exist or some model assumption is violated. 
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Fig. 1. True network used in the simulation. The variables /i and /2 were latent 
confounders. The green contours include variables that share /i or /2. The external 
influences e; are omitted to be shown. 



Note that our algorithm would output no causal orders in cases that such ex- 
ogenous variables and sink variables as in Lemmas [T] and [5] do not exist, although 
the outputs are still correct. One way to learn more causal orders in those cases 
would be to develop a divide-and-conquer algorithm that divides variables into 
subsets where such exogenous or sink variables exist and integrates the estima- 
tion results on the subsets. This is an important direction of future research. 



4 Experiments on artificial data 

We compared our method with an estimation method for LiNGAM ([2]) called 
DirectLiNGAM [TU] that does not allow latent confounders and an estimation 
method for latent variable LiNGAM dSj called Pairwise LvLiNGAM [2]. If there 
is no latent confounders, all the methods should estimate correct causal orders for 
enough large sample sizes. The number of variables was 6, and the sample sizes 
tested were 500, 1000, 2000. The original network used was shown in Figure [TJ 
The ei, 64 and /i followed a multimodal asymmetric mixture of two Gaussians, 
e2,e5,/2 followed a double exponential distribution, and 63 and eg followed a 
multimodal symmetric mixture of two Gaussians. The standard deviations of 
the Ci were set so that their signal-to-noise ratios, i.e., var(2;i)/var(ei) — 1 were 
all ones. The number of trials was 100. The significance level a was 0.05. 

First, to evaluate performance of estimating causal orders fc(i), we computed 
the percentage of correctly estimated causal orders in estimated causal orders be- 
tween two variables (Precision) and the percentage of correctly estimated causal 
orders in actual causal orders between two variables that share no latent con- 
founders in the true data generating network (Recall). The reason why only 
pairwise causal orders were evaluated was that Pairwise LvLiNGAM only es- 
timates causal orders of two variables unlike our method and DirectLiNGAM. 
Tables [T] and [2] show the results. Regarding precisions, our method was com- 
parable to Pairwise LvLiNGAM and the two methods were much better than 
DirectLiNGAM for all the conditions. Regarding recalls, our method was better 
than both DirectLiNGAM and Pairwise LvLiNGAM for all the conditions. 



6 T. Tashiro, S. Shimizu, A. Hyvarinen and T. Washio 



Table 1. Precisions 



Table 2. Recalls 



Sample size 



Sample size 



500 1000 2000 



500 1000 2000 



Our method 
DirectLiNGAM 
Pairwise LvLiNGAM 



0.78 0.80 0.80 
0.65 0.64 0.64 
0.79 0.81 0.81 



Our method 
DirectLiNGAM 
Pairwise LvLiNGAM 



0.97 0.99 0.99 
0.81 0.80 0.81 
0.86 0.89 0.90 



Next, to evaluate the performance in estimating connection strengths bij, we 
computed the root mean square errors between true connection strengths and 
estimated ones. The root mean square errors for our method and DirectLiNGAM 
were 0.079 and 0.090 for 500 data points, 0.070 and 0.079 for 1000 data points 
and 0.015 and 0.057 for 2000 data points, respectively, where our method was 
more accurate. Note that Pairwise LvLiNGAM does not estimate bij. 

5 Conclusions 

We proposed a new algorithm for learning causal orders, which is robust against 
latent confounders. In experiments on artificial data, our approach learned more 
causal orders accurately than two existing methods. In future work, we would like 
to test our method on real-world data including functional magnetic resonance 
imaging data to analyze causal interactions between brain regions. 
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Appendix: Proofs of the lemmas 

Theorem 1 (Darmois-Skitovitch theorem (D-S theorem) [1]) Define two 
random variables yi and j/2 o,s linear combinations of independent random vari- 
ables Si(i^l, q): yi = J2i=i'^i^f 2/2 = Z^LiAsi- Then, if yi and y2 are 
independent, all variables Sj for which OLjj3j ^ are Gaussian. □ 

In other words, this theorem means that if there exists a non-Gaussian Sj for 
which ajjSj^Q, yi and y2 are dependent. 

Further, Lemma 3 of [2] has shown that the regressor and its residual in simple 
linear regression are dependent if there are some latent confounders between the 
regressor and regressand in the latent variable LiNGAM ([3]). 

Proof of Lemma[T] i) Assume that Xj has at least one parent observed variable 
or latent confounder. Let Pj denote the set of the parent variables of Xj . Then one 
can write Xj=J2pi^£p- WjhPh+ej, where the parent variables ph are independent 
of Cj and the coefficients Wjh are non-zero. Suppose that Xi is a parent of Xj. For 

such x,, we have = x, - -;±^Xj ^ x, - -^;t{^UZp^eP, ^JhPh + ej) = 

[-, wj,cov{x,,Xj) \ cov{x,,Xj) sr^ cov{x,,Xj) Farh nf thnsp 

parent variables (including a;^) in Pj is a linear combination of external influences 
other than Cj and latent confounders that arc non-Gaussian and independent. 

Thus, the rY' and Xj can be written as linear combinations of non-Gaussian and 
independent external influences including Cj and latent confounders. Further, the 
coefficient of ej on rp"* is non-zero since cov{xi,Xj) ^ due to the faithfulness 
and that on Xj is one by definition. These imply that r^p and Xj are dependent 
since , Xj and Cj correspond to y\, 7/2, Sj in D-S theorem, respectively. Next, 
for the other case that Xj has a latent confounder, r^-""'^ and an observed variable 
can be shown to be dependent using Lemma 3 of [2] since by definition at least 
one observed variable shares the latent confounder with Xj. 

ii) The converse of contrapositive of i) is straightforward using the model 
definition. From i) and ii), the lemma is proven. ■ 
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Proof of Lemma[2] i) Assume that a variable xj has at least one child observed 
variable or latent confounder. First, without loss of generality, one can write 



-3) 



= (I - B)-i (Af + e) = A(Af + e) 



Ajf 



(6) 
(7) 



where each of A (= (I — B)"-'^) and A(_j) is invertible and can be permuted to 
be a lower triangular matrix with the diagonal elements being ones if the rows 
and columns are simultaneously permuted according to the causal ordering k{i). 
The same applies to the inverse of A: 



where D 



A-i = 



H-j) 



D 



1^ 



Thus, 1 - 



-1 -a^, .D- 
D 1 



(-jb-^(-j)^(-i) 
Ajf + ej + aJ(_^-)(A(_j)f + e(_j-)) 



1. Then, 



-{1- 



(Ajf + e,)+A(_,)(A(_,)f + e(_,)) 



a:,_,,A(_j-) - crf_^-)^-Z'(J-^.)(a(_j)jAj + A(_j) A(_j))}f 



In Equation PT|) . if a^. 



0"^, then we have 



(8) 



(9) 



(10) 



)-(ll) 



{Xj{l 
= Ajf + 



+ - ^i-3)\-j)''i~3)3}^3 (12) 

(13) 



Thus, the coefhcient of Cj on is one. Now, suppose that Xj has a child Xi. 

The coefficient of on Xi is non-zero due to the faithfulness. Thus, and Xi 
are dependent due to D-S theorem. Next, suppose that Xj has a latent confounder 
fi. Then, in Equation pip , the corresponding element in Xj is not zero, i.e., the 
coefficient of fi on J'j is not zero. Further, fi has a non-zero coefficient on 



at least one variable in x 



(-i) 



due to the definition of latent confounders and 



faithfulness. Therefore, 



and 



are dependent due to D-S theorem. 



H-j) 



^0' 



" is not zero. By 



On the other hand, in Equation pT|) . if — 

at least one of the coefficients of the elements in e(_j) , ^ 
definition, every element in e(_j) has a non-zero coefficient on the corresponding 

element in ^(-j). Thus, fj""''' and X(_j) are dependent due to D-S theorem. 

ii) The converse of contrapositive of i) is straightforward using the model 
definition. From i) and ii), the lemma is proven. ■ 



